Predicting house prices

In this lesson, we'll explore the boston housing dataset (which is built into sklearn), and walk through some basic principles of setting up and building, tuning and selecting a valid machine learning model.

This lesson will use sklearn in conjunction with several skutil preprocessing techniques.



In [2]:

    
from __future__ import print_function, division
import sklearn
import skutil
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

print('sklearn: %s' % sklearn.__version__)
print('skutil:  %s' % skutil.__version__)
print('pandas:  %s' % pd.__version__)
print('numpy:   %s' % np.__version__)









    



sklearn: 0.17.1
skutil:  0.0.13
pandas:  0.18.1
numpy:   1.11.0

Loading and inspecting our data

We can load our pandas dataframe and examine a sample of the data we'll be working with. On first glance, it appears every feature is numeric. This is certainly cleaner than most realworld datasets!



In [3]:

    
from sklearn.datasets import load_boston

boston = load_boston()
X = pd.DataFrame.from_records(data=boston.data, columns=boston.feature_names)
X.head()

Let's examine the first few values of our target variable. Notice the value is a real number and not a class, indicating we'll we using regressive models and not classification.



In [4]:

    
y = boston.target
y[:5]









    Out[4]:





array([ 24. ,  21.6,  34.7,  33.4,  36.2])

By examining the dtypes (data types) attribute of the dataframe, our suspicion is confirmed: all of the features are in fact numeric. Below, we also take a look at whether there are any missing values. Luckily, in this example there are not.



In [5]:

    
X.dtypes









    Out[5]:





CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
dtype: object



In [6]:

    
X.isnull().sum()









    Out[6]:





CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64



In [7]:

    
X.describe()

Cleaning & preprocessing data

Uncommonly—if not never—will your data not need any sort of preprocessing. Be it noisy features, skewed variables, redundant or uninformative features, your data will generally always need some massaging.

One thing to be aware of is whether your input is shuffled or in order... if your data is ordered, know why. For example, let's build a RandomForestRegressor on the first few rows of X, and validate the model on the last rows:



In [8]:

    
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

def rmse(act, pred):
    return np.sqrt(mean_squared_error(act, pred))

# define the model
model = RandomForestRegressor(random_state=42)

# fit the model
model.fit(X[:350], y[:350])

# assess performance
print('Train R^2: %.5f'  % r2_score(y[:350], model.predict(X[:350])))
print('Train RMSE: %.5f\n' % rmse(y[:350], model.predict(X[:350])))

print('Test R^2: %.5f'  % r2_score(y[350:], model.predict(X[350:])))
print('Test RMSE: %.5f' % rmse(y[350:], model.predict(X[350:])))









    



Train R^2: 0.97543
Train RMSE: 1.32625

Test R^2: 0.16424
Test RMSE: 7.44935

Notice the extreme drop-off in validation performance! It's likely there are phenomena in the test data that were not observed in the training data, and the model was not induced to capture such nuances.

The `train_test_split`

The sooner you can split your data, the better. sklearn provides a built-in mechanism for just this: sklearn.cross_validation.train_test_split. This will split your data to the specified size, and shuffle the observations at the same time.

Notice we create three splits:

Train—the set on which we'll develop our model
Validate—the set on which we assess performance of our model
Test—the set we'll use to select our final model. We don't touch this until the end!



In [9]:

    
from sklearn.cross_validation import train_test_split

tr_size = int(0.6 * X.shape[0])
va_te_size = int((X.shape[0] - tr_size) / 2)

# split the train/val and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=va_te_size) 

# split the train/val apart
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=va_te_size)

print('Train size: %i' % X_train.shape[0])
print('Validation size: %i' % X_val.shape[0])
print('Holdout size: %i' % X_test.shape[0])









    



Train size: 304
Validation size: 101
Holdout size: 101



In [10]:

    
# fit the model
model.fit(X_train, y_train)

# assess performance
print('Train R^2: %.5f'  % r2_score(y_train, model.predict(X_train)))
print('Train RMSE: %.5f\n' % rmse(y_train, model.predict(X_train)))

print('Val R^2: %.5f'  % r2_score(y_val, model.predict(X_val)))
print('Val RMSE: %.5f' % rmse(y_val, model.predict(X_val)))









    



Train R^2: 0.96157
Train RMSE: 1.76292

Val R^2: 0.81943
Val RMSE: 4.00728

Notice that our validation performance is now much more similar to our training performance.

Note: It is bad practice to evaluate your model against your test set whilst modeling, so we use our validation set to examine incremental performance

Transforming features

How can we make this model perform better? There may be some strange/skewed distributions within our data that we could coerce into a more normal shape. Let's take a look at just a few (you could do this for all features, but for the sake of example, we'll only look at a handful).



In [11]:

    
# start by defining a very simple histogram function
def hist(x, scale = 1, style = 'darkgrid', left = None, right = None, xlab='Count', ylab='Y'):
    x = x if not isinstance(x, pd.Series) else x.tolist()
    
    figure = plt.figure()
    
    sns.set(style = style)
    sns.distplot(x * scale, hist = True, kde = False, norm_hist = True)
    
    ax = figure.get_axes()[0]
    ax.set_xlim(left  = left  or int(np.ceil(np.min(x))), 
                right = right or int(np.ceil(np.max(x))))
    
    ax.set_xlabel(xlab)
    ax.set_ylabel(ylab)

Notice the crime feature is quite skewed. We may be able to make it more normal with a BoxCoxTransformer. We can apply this technique to other features as well, but it is not always guaranteed to work well.



In [12]:

    
hist(x=X_train.CRIM, ylab='Crime rate')



In [14]:

    
from skutil.preprocessing import BoxCoxTransformer

hist(BoxCoxTransformer(cols=['CRIM']).fit_transform(X_train).CRIM.tolist(), ylab='Crime transformed')

Feature selection

How do we know which features to retain? In this toy example, we have a manageable amount of features, however in the text analytics or computer vision domains, we often have >100,000 features. Let's explore techniques for reducing this high dimensionality without impacting the predictive power of our model (in no particular order):

1. Eliminate multicollinearity:



In [15]:

    
from skutil.feature_selection import MulticollinearityFilterer

# let's see if any features are collinear with one another:
fltr = MulticollinearityFilterer(threshold=0.9).fit(X_train)

# examine the drop attribute
fltr.drop









    Out[15]:





['TAX']

The MulticollinearityFilterer searches through a correlation matrix for any correlations greater than the provided threshold. When a high correlation is observed between two variables, the function examines the mean absolute correlation of each feature and removes the one that is most highly-correlated with other features as well.

2. Eliminate features with near zero variance:



In [16]:

    
from skutil.feature_selection import NearZeroVarianceFilterer

# define and fit the filterer
fltr = NearZeroVarianceFilterer(threshold=1e-4).fit(X_train)

# examine the dropped cols
fltr.drop

Notice there are no features with variance less than the threshold, so the result was None. If we wanted, we could adjust that threshold.

3. PCA (Principal Component Analysis)

(Note that this isn't actually a feature selection technique, but a feature reduction technique that results in a set of features which are linear combinations of the original input space)



In [17]:

    
from skutil.decomposition import SelectivePCA

# define and fit
pca = SelectivePCA(n_components=0.85).fit(X_train)

# examine the head
pca.transform(X_train).head()









    Out[17]:






  
    
      
      PC1
      PC2
    
  
  
    
      0
      -238.102956
      110.325234
    
    
      1
      138.802417
      -3.246269
    
    
      2
      89.139731
      9.242997
    
    
      3
      111.718522
      -15.849379
    
    
      4
      110.971250
      8.055599

Notice we effectively reduced the input space to two dimensions while retaining at least 85% of the variability in the matrix!

Putting it all together in a model

How can we use all of the aforementioned techniques to preprocess our data prior to modeling?



In [20]:

    
from skutil.preprocessing import SelectiveScaler

# multicollinearity
mcf = MulticollinearityFilterer(threshold=0.9).fit(X_train)
mcf_train = mcf.transform(X_train)

# near zero variance
nzv = NearZeroVarianceFilterer(threshold=1e-4).fit(mcf_train)
nzv_train = nzv.transform(mcf_train)

# add a step: scaling
scl = SelectiveScaler().fit(nzv_train)
scl_train = scl.transform(nzv_train)

# pca
pca = SelectivePCA(n_components=0.85).fit(scl_train)
pca_train = pca.transform(scl_train)

# fit the model
model.fit(pca_train, y_train)

# assess performance on validation set
print('Train R^2: %.5f'  % r2_score(y_train, model.predict(pca_train)))
print('Train RMSE: %.5f\n' % rmse(y_train, model.predict(pca_train)))









    



Train R^2: 0.95395
Train RMSE: 1.92976

That's nice, but it's kind of a mess. What if we have a ton of preprocessors to keep track of? Things could get hairy. Furthermore, if we want to assess performance on our validation set, we have to get just as many intermediate predictions. What a pain!

That's what the Pipeline object is for. Pipeline stores a sequence of named transformers with an optional BaseEstimator as the last element. The only arg in the Pipeline constructor is a single list of tuples:

pipe = Pipeline([
    ('name_of_first_step',  FirstTransformer()),
    ('name_of_second_step', SecondTransformer())
])



In [21]:

    
from sklearn.pipeline import Pipeline

# define our pipe
pipe = Pipeline([
        ('mc',  MulticollinearityFilterer(threshold=0.9)),
        ('nzv', NearZeroVarianceFilterer(threshold=1e-4)),
        ('sc',  SelectiveScaler()),
        ('pca', SelectivePCA(n_components=0.85)),
        ('rf',  RandomForestRegressor(random_state=42))
    ])

# fit our pipeline
pipe.fit(X_train, y_train)

# assess performance
print('Train R^2: %.5f'  % r2_score(y_train, pipe.predict(X_train)))
print('Train RMSE: %.5f\n'   % rmse(y_train, pipe.predict(X_train)))

print('Validation R^2: %.5f'  % r2_score(y_val, pipe.predict(X_val)))
print('Validation RMSE: %.5f\n'   % rmse(y_val, pipe.predict(X_val)))









    



Train R^2: 0.95395
Train RMSE: 1.92976

Validation R^2: 0.70628
Validation RMSE: 5.11078

Notice we get the exact same results, but the code is much more elegant with fewer intermediate variables lying around. We can also assess performance on our validation set.

However, at a closer inspection, we can see that our results are not as good as they were before preprocessing. Presumably, if we could tweak our preprocessing hyperparameters to optimize our algorithm, we could identify a model with better performance. Furthermore, the astute will note that we have not yet introduced any cross validation:



In [22]:

    
from sklearn.cross_validation import KFold

# the default sklearn cross validation does NOT shuffle, and you know how we feel about that...
custom_cv = KFold(n=y_train.shape[0], n_folds=5, shuffle=True, random_state=42)

Now we introduce the grid search, the mechanism by which we will search over a random space of hyperparameters, building cross-validated models at each iteration and retaining the model that performs best.



In [23]:

    
# make sure to use the SKUTIL grid search for DF compatability, and not the SKLEARN one.
from skutil.grid_search import RandomizedSearchCV
from scipy.stats import uniform, randint
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

# set our CV
custom_cv = KFold(n=y_train.shape[0], n_folds=5, shuffle=True, random_state=42)

# define our pipe -- let's remove our BC step
pipe = Pipeline([
        ('mc',  MulticollinearityFilterer()),
        ('nzv', NearZeroVarianceFilterer()),
        ('sc',  SelectiveScaler()),
        ('pca', SelectivePCA()),
        ('rf',  RandomForestRegressor(random_state=42))
    ])

# let's define the hyperparameters we'll search over. Notice the form of:
# '<stage_nm>__<arg_nm>'
hyperparams = {
    'mc__threshold'        : uniform(0.95, 0.05),
    'nzv__threshold'       : [1e-4, 1e-2],
    'sc__scaler'           : [StandardScaler(), RobustScaler(), MinMaxScaler()],
    'pca__n_components'    : randint(4, X.shape[1]),
    'pca__whiten'          : [True, False],
    'rf__n_estimators'     : randint(50, 100),
    'rf__max_depth'        : randint(4, 15),
    'rf__min_samples_leaf' : randint(1, 10),
    'rf__min_samples_split': randint(2, 5),
    'rf__max_features'     : uniform(loc=.5, scale=.5),
    'rf__max_leaf_nodes'   : randint(10,50)
}

# define and fit
search = RandomizedSearchCV(pipe, 
                            hyperparams, 
                            cv=custom_cv, 
                            scoring='r2',
                            random_state=42, 
                            n_iter=30)
search.fit(X_train, y_train)

# assess performance
print('Validation R^2: %.5f'  % r2_score(y_val, search.predict(X_val)))
print('Validation RMSE: %.5f\n'   % rmse(y_val, search.predict(X_val)))









    



Validation R^2: 0.72060
Validation RMSE: 4.98469

We can actually view our grid results like so:



In [24]:

    
from skutil.utils import report_grid_score_detail
report_grid_score_detail(random_search=search, charts=True)









    












    












    












    












    












    












    












    












    












    












    












    Out[24]:






  
    
      
      mc__threshold
      nzv__threshold
      pca__n_components
      pca__whiten
      rf__max_depth
      rf__max_features
      rf__max_leaf_nodes
      rf__min_samples_leaf
      rf__min_samples_split
      rf__n_estimators
      sc__scaler
      score
      std
    
  
  
    
      7
      0.988241
      0.0100
      9
      False
      8
      0.710470
      45
      2
      4
      99
      RobustScaler(copy=True, with_centering=True, w...
      0.791127
      0.173148
    
    
      22
      0.987902
      0.0100
      8
      True
      11
      0.637820
      29
      1
      3
      78
      StandardScaler(copy=True, with_mean=True, with...
      0.782222
      0.171858
    
    
      20
      0.967801
      0.0100
      8
      True
      6
      0.880463
      14
      2
      2
      80
      RobustScaler(copy=True, with_centering=True, w...
      0.779109
      0.176249
    
    
      23
      0.989798
      0.0100
      10
      True
      13
      0.626435
      48
      2
      3
      93
      MinMaxScaler(copy=True, feature_range=(0, 1))
      0.761346
      0.185347
    
    
      26
      0.987078
      0.0100
      11
      False
      12
      0.634674
      26
      4
      2
      88
      RobustScaler(copy=True, with_centering=True, w...
      0.758516
      0.138921
    
    
      19
      0.983026
      0.0100
      6
      False
      5
      0.522999
      31
      4
      4
      66
      RobustScaler(copy=True, with_centering=True, w...
      0.755499
      0.162324
    
    
      17
      0.977988
      0.0001
      12
      True
      7
      0.909229
      28
      6
      2
      65
      StandardScaler(copy=True, with_mean=True, with...
      0.750730
      0.187991
    
    
      8
      0.979696
      0.0100
      6
      False
      4
      0.552620
      47
      2
      4
      65
      StandardScaler(copy=True, with_mean=True, with...
      0.749867
      0.178026
    
    
      0
      0.983156
      0.0001
      6
      False
      13
      0.618107
      31
      5
      3
      64
      StandardScaler(copy=True, with_mean=True, with...
      0.747108
      0.137729
    
    
      3
      0.981160
      0.0001
      12
      True
      11
      0.953544
      20
      3
      3
      76
      MinMaxScaler(copy=True, feature_range=(0, 1))
      0.740208
      0.220455
    
    
      4
      0.956413
      0.0001
      12
      False
      8
      0.710228
      15
      6
      2
      60
      StandardScaler(copy=True, with_mean=True, with...
      0.738371
      0.185873
    
    
      18
      0.981937
      0.0001
      4
      False
      10
      0.501948
      11
      3
      4
      64
      StandardScaler(copy=True, with_mean=True, with...
      0.728955
      0.172770
    
    
      13
      0.995750
      0.0100
      12
      False
      13
      0.525978
      14
      3
      4
      53
      MinMaxScaler(copy=True, feature_range=(0, 1))
      0.728007
      0.208564
    
    
      24
      0.963270
      0.0100
      4
      True
      12
      0.897112
      28
      1
      4
      50
      RobustScaler(copy=True, with_centering=True, w...
      0.719867
      0.235551
    
    
      1
      0.979604
      0.0001
      12
      True
      7
      0.947611
      12
      9
      4
      76
      StandardScaler(copy=True, with_mean=True, with...
      0.712489
      0.222223
    
    
      10
      0.968305
      0.0100
      9
      True
      7
      0.554112
      15
      6
      2
      74
      StandardScaler(copy=True, with_mean=True, with...
      0.711891
      0.140203
    
    
      2
      0.961227
      0.0001
      11
      True
      13
      0.897544
      38
      8
      4
      53
      RobustScaler(copy=True, with_centering=True, w...
      0.711576
      0.126452
    
    
      12
      0.972200
      0.0001
      8
      False
      9
      0.559725
      20
      4
      4
      92
      MinMaxScaler(copy=True, feature_range=(0, 1))
      0.710379
      0.199472
    
    
      27
      0.991030
      0.0100
      8
      False
      5
      0.750453
      45
      9
      2
      96
      RobustScaler(copy=True, with_centering=True, w...
      0.706907
      0.111591
    
    
      25
      0.963694
      0.0100
      5
      False
      4
      0.825448
      41
      8
      3
      56
      RobustScaler(copy=True, with_centering=True, w...
      0.705611
      0.160387
    
    
      15
      0.990545
      0.0100
      11
      False
      11
      0.600034
      45
      8
      4
      69
      StandardScaler(copy=True, with_mean=True, with...
      0.703879
      0.169953
    
    
      9
      0.954097
      0.0001
      9
      False
      12
      0.662525
      19
      8
      3
      71
      RobustScaler(copy=True, with_centering=True, w...
      0.703771
      0.150605
    
    
      28
      0.973833
      0.0001
      7
      True
      14
      0.810080
      35
      9
      2
      85
      RobustScaler(copy=True, with_centering=True, w...
      0.700193
      0.121987
    
    
      6
      0.973238
      0.0100
      9
      True
      6
      0.995173
      45
      9
      4
      73
      RobustScaler(copy=True, with_centering=True, w...
      0.698918
      0.131617
    
    
      11
      0.972257
      0.0001
      12
      True
      11
      0.938802
      48
      8
      4
      78
      MinMaxScaler(copy=True, feature_range=(0, 1))
      0.689296
      0.203342
    
    
      29
      0.976383
      0.0100
      6
      False
      14
      0.735929
      10
      9
      3
      87
      RobustScaler(copy=True, with_centering=True, w...
      0.681497
      0.141443
    
    
      21
      0.957156
      0.0001
      6
      False
      5
      0.872918
      10
      4
      3
      78
      MinMaxScaler(copy=True, feature_range=(0, 1))
      0.681335
      0.234420
    
    
      14
      0.984599
      0.0100
      6
      True
      11
      0.722688
      26
      7
      2
      69
      MinMaxScaler(copy=True, feature_range=(0, 1))
      0.677875
      0.184055
    
    
      5
      0.994759
      0.0100
      7
      False
      5
      0.545275
      14
      9
      3
      62
      MinMaxScaler(copy=True, feature_range=(0, 1))
      0.639748
      0.141657
    
    
      16
      0.979921
      0.0001
      5
      True
      4
      0.683391
      11
      8
      4
      56
      MinMaxScaler(copy=True, feature_range=(0, 1))
      0.506704
      0.340709

In viewing these results, we can make educated decisions on refining our grid such that we don't waste time searching over parameters that detrimentally impact the performance.



In [25]:

    
# set our CV
custom_cv = KFold(n=y_train.shape[0], n_folds=5, shuffle=True, random_state=42)

# define our pipe
pipe = Pipeline([
        ('mc',  MulticollinearityFilterer()),
        ('nzv', NearZeroVarianceFilterer()),
        ('sc',  SelectiveScaler(scaler=StandardScaler())), # set to robust
        ('pca', SelectivePCA()),
        ('rf',  RandomForestRegressor(random_state=42))
    ])

# we can narrow our search parameters now
hyperparams = {
    'mc__threshold'        : uniform(0.95, 0.05),
    'nzv__threshold'       : [1e-4, 1e-2],
    'sc__scaler'           : [StandardScaler(), RobustScaler(), MinMaxScaler()],
    'pca__n_components'    : randint(8, X.shape[1]),
    'pca__whiten'          : [True, False],
    'rf__n_estimators'     : randint(75, 100),
    'rf__max_depth'        : randint(4, 15),
    'rf__min_samples_leaf' : randint(1, 8),
    'rf__min_samples_split': randint(2, 5),
    'rf__max_features'     : uniform(loc=.5, scale=.5),
    'rf__max_leaf_nodes'   : randint(25,50)
}

# define and fit
search = RandomizedSearchCV(pipe, 
                            hyperparams, 
                            cv=custom_cv, 
                            scoring='r2',
                            random_state=42, 
                            n_iter=30) # incremented to 20 X
search.fit(X_train, y_train)

# assess performance
print('Validation R^2: %.5f'  % r2_score(y_val, search.predict(X_val)))
print('Validation RMSE: %.5f\n'   % rmse(y_val, search.predict(X_val)))









    



Validation R^2: 0.73432
Validation RMSE: 4.86073

Trying a different model

We have a pretty good RandomForestRegressor at this point... why don't we try a different model?



In [26]:

    
from sklearn.ensemble import GradientBoostingRegressor

# define our pipe
gbm_pipe = Pipeline([
        ('mc',  MulticollinearityFilterer(threshold=0.9)),
        ('nzv', NearZeroVarianceFilterer(threshold=1e-4)),
        ('sc',  SelectiveScaler()),
        ('pca', SelectivePCA(n_components=0.85)),
        ('gbm', GradientBoostingRegressor(random_state=42))
    ])

# let's define the hyperparameters we'll search over.
gbm_hyperparams = {
    'mc__threshold'        : uniform(0.80, 0.15),
    'sc__scaler'           : [StandardScaler(), RobustScaler(), MinMaxScaler()],
    'pca__n_components'    : uniform(0.95, 0.05),
    'gbm__n_estimators'    : randint(90, 200),
    'gbm__learning_rate'   : uniform(0.075, 0.05),
    'gbm__max_depth'       : randint(2, 7), # we grow these more shallow
}

# define and fit
gbm_search = RandomizedSearchCV(gbm_pipe, 
                                gbm_hyperparams, 
                                cv=custom_cv, 
                                scoring='r2',
                                random_state=42, 
                                n_iter=30)
gbm_search.fit(X_train, y_train)

# assess performance
print('Validation R^2: %.5f'  % r2_score(y_val, gbm_search.predict(X_val)))
print('Validation RMSE: %.5f\n'   % rmse(y_val, gbm_search.predict(X_val)))









    



Validation R^2: 0.80500
Validation RMSE: 4.16424

Wow, that looks fantastic! However, GBMs are much more likely to overfit the data. We'll need to see how it performs on the holdout set to determine whether it's actually a good model.

Evaluating performance on our holdout set

This happens once. When you've built a selection of models, we evaluate each one time against the holdout set for our final model selection.



In [27]:

    
# examine RF performance
print('RF test R^2: %.5f'    % r2_score(y_test, search.predict(X_test)))
print('RF test RMSE: %.5f\n' % rmse(y_test,     search.predict(X_test)))

# examine GBM performance
print('GBM test R^2: %.5f'  % r2_score(y_test, gbm_search.predict(X_test)))
print('GBM test RMSE: %.5f' % rmse(y_test,     gbm_search.predict(X_test)))









    



RF test R^2: 0.69413
RF test RMSE: 5.18086

GBM test R^2: 0.72310
GBM test RMSE: 4.92935

In our RF, our validation error closely resembles that of our holdout error, which is indicative that we are not overfitting. However, our GBM indicates otherwise. We can tune our hyperparameters to fix this, but for the sake of example, we will not in this demo.

See the next demo for information on how to persist models to disk.



In [ ]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.593761	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063
std	8.596783	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000
75%	3.647423	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000

	PC1	PC2
0	-238.102956	110.325234
1	138.802417	-3.246269
2	89.139731	9.242997
3	111.718522	-15.849379
4	110.971250	8.055599

	mc__threshold	nzv__threshold	pca__n_components	pca__whiten	rf__max_depth	rf__max_features	rf__max_leaf_nodes	rf__min_samples_leaf	rf__min_samples_split	rf__n_estimators	sc__scaler	score	std
7	0.988241	0.0100	9	False	8	0.710470	45	2	4	99	RobustScaler(copy=True, with_centering=True, w...	0.791127	0.173148
22	0.987902	0.0100	8	True	11	0.637820	29	1	3	78	StandardScaler(copy=True, with_mean=True, with...	0.782222	0.171858
20	0.967801	0.0100	8	True	6	0.880463	14	2	2	80	RobustScaler(copy=True, with_centering=True, w...	0.779109	0.176249
23	0.989798	0.0100	10	True	13	0.626435	48	2	3	93	MinMaxScaler(copy=True, feature_range=(0, 1))	0.761346	0.185347
26	0.987078	0.0100	11	False	12	0.634674	26	4	2	88	RobustScaler(copy=True, with_centering=True, w...	0.758516	0.138921
19	0.983026	0.0100	6	False	5	0.522999	31	4	4	66	RobustScaler(copy=True, with_centering=True, w...	0.755499	0.162324
17	0.977988	0.0001	12	True	7	0.909229	28	6	2	65	StandardScaler(copy=True, with_mean=True, with...	0.750730	0.187991
8	0.979696	0.0100	6	False	4	0.552620	47	2	4	65	StandardScaler(copy=True, with_mean=True, with...	0.749867	0.178026
0	0.983156	0.0001	6	False	13	0.618107	31	5	3	64	StandardScaler(copy=True, with_mean=True, with...	0.747108	0.137729
3	0.981160	0.0001	12	True	11	0.953544	20	3	3	76	MinMaxScaler(copy=True, feature_range=(0, 1))	0.740208	0.220455
4	0.956413	0.0001	12	False	8	0.710228	15	6	2	60	StandardScaler(copy=True, with_mean=True, with...	0.738371	0.185873
18	0.981937	0.0001	4	False	10	0.501948	11	3	4	64	StandardScaler(copy=True, with_mean=True, with...	0.728955	0.172770
13	0.995750	0.0100	12	False	13	0.525978	14	3	4	53	MinMaxScaler(copy=True, feature_range=(0, 1))	0.728007	0.208564
24	0.963270	0.0100	4	True	12	0.897112	28	1	4	50	RobustScaler(copy=True, with_centering=True, w...	0.719867	0.235551
1	0.979604	0.0001	12	True	7	0.947611	12	9	4	76	StandardScaler(copy=True, with_mean=True, with...	0.712489	0.222223
10	0.968305	0.0100	9	True	7	0.554112	15	6	2	74	StandardScaler(copy=True, with_mean=True, with...	0.711891	0.140203
2	0.961227	0.0001	11	True	13	0.897544	38	8	4	53	RobustScaler(copy=True, with_centering=True, w...	0.711576	0.126452
12	0.972200	0.0001	8	False	9	0.559725	20	4	4	92	MinMaxScaler(copy=True, feature_range=(0, 1))	0.710379	0.199472
27	0.991030	0.0100	8	False	5	0.750453	45	9	2	96	RobustScaler(copy=True, with_centering=True, w...	0.706907	0.111591
25	0.963694	0.0100	5	False	4	0.825448	41	8	3	56	RobustScaler(copy=True, with_centering=True, w...	0.705611	0.160387
15	0.990545	0.0100	11	False	11	0.600034	45	8	4	69	StandardScaler(copy=True, with_mean=True, with...	0.703879	0.169953
9	0.954097	0.0001	9	False	12	0.662525	19	8	3	71	RobustScaler(copy=True, with_centering=True, w...	0.703771	0.150605
28	0.973833	0.0001	7	True	14	0.810080	35	9	2	85	RobustScaler(copy=True, with_centering=True, w...	0.700193	0.121987
6	0.973238	0.0100	9	True	6	0.995173	45	9	4	73	RobustScaler(copy=True, with_centering=True, w...	0.698918	0.131617
11	0.972257	0.0001	12	True	11	0.938802	48	8	4	78	MinMaxScaler(copy=True, feature_range=(0, 1))	0.689296	0.203342
29	0.976383	0.0100	6	False	14	0.735929	10	9	3	87	RobustScaler(copy=True, with_centering=True, w...	0.681497	0.141443
21	0.957156	0.0001	6	False	5	0.872918	10	4	3	78	MinMaxScaler(copy=True, feature_range=(0, 1))	0.681335	0.234420
14	0.984599	0.0100	6	True	11	0.722688	26	7	2	69	MinMaxScaler(copy=True, feature_range=(0, 1))	0.677875	0.184055
5	0.994759	0.0100	7	False	5	0.545275	14	9	3	62	MinMaxScaler(copy=True, feature_range=(0, 1))	0.639748	0.141657
16	0.979921	0.0001	5	True	4	0.683391	11	8	4	56	MinMaxScaler(copy=True, feature_range=(0, 1))	0.506704	0.340709